Capstone Project: Create a Customer Segmentation Report for Arvato Financial Services

In this project, you will analyze demographics data for customers of a mail-order sales company in Germany, comparing it against demographics information for the general population. You'll use unsupervised learning techniques to perform customer segmentation, identifying the parts of the population that best describe the core customer base of the company. Then, you'll apply what you've learned on a third dataset with demographics information for targets of a marketing campaign for the company, and use a model to predict which individuals are most likely to convert into becoming customers for the company. The data that you will use has been provided by our partners at Bertelsmann Arvato Analytics, and represents a real-life data science task.

If you completed the first term of this program, you will be familiar with the first part of this project, from the unsupervised learning project. The versions of those two datasets used in this project will include many more features and has not been pre-cleaned. You are also free to choose whatever approach you'd like to analyzing the data rather than follow pre-determined steps. In your work on this project, make sure that you carefully document your steps and decisions, since your main deliverable for this project will be a blog post reporting your findings.

Part 0: Get to Know the Data

There are four data files associated with this project:

Each row of the demographics files represents a single person, but also includes information outside of individuals, including information about their household, building, and neighborhood. Use the information from the first two files to figure out how customers ("CUSTOMERS") are similar to or differ from the general population at large ("AZDIAS"), then use your analysis to make predictions on the other two files ("MAILOUT"), predicting which recipients are most likely to become a customer for the mail-order company.

The "CUSTOMERS" file contains three extra columns ('CUSTOMER_GROUP', 'ONLINE_PURCHASE', and 'PRODUCT_GROUP'), which provide broad information about the customers depicted in the file. The original "MAILOUT" file included one additional column, "RESPONSE", which indicated whether or not each recipient became a customer of the company. For the "TRAIN" subset, this column has been retained, but in the "TEST" subset it has been removed; it is against that withheld column that your final predictions will be assessed in the Kaggle competition.

Otherwise, all of the remaining columns are the same between the three data files. For more information about the columns depicted in the files, you can refer to two Excel spreadsheets provided in the workspace. One of them is a top-level list of attributes and descriptions, organized by informational category. The other is a detailed mapping of data values for each feature in alphabetical order.

In the below cell, we've provided some initial code to load in the first two datasets. Note for all of the .csv data files in this project that they're semicolon (;) delimited, so an additional argument in the read_csv() call has been included to read in the data properly. Also, considering the size of the datasets, it may take some time for them to load completely.

You'll notice when the data is loaded in that a warning message will immediately pop up. Before you really start digging into the modeling and analysis, you're going to need to perform some cleaning. Take some time to browse the structure of the data and look over the informational spreadsheets to understand the data values. Make some decisions on which features to keep, which features to drop, and if any revisions need to be made on data formats. It'll be a good idea to create a function with pre-processing steps, since you'll need to clean all of the datasets before you work with them.

Instantiate PreProcessor and Prepare the data

Create an instance of Documentation

Create an instance of FeatureEngineer

Create an instance the PreProcessor class

Take a look at the AZDIAS before and after preprocessing

PCA

Explained variance

We want to choose the n top principal components to contain at least 90% of data variance. So we will have a look a t the explained variance first...

The plot of the culumative sum also has three marked points:

Plot the explained variance ratio of the first 50 components

We will choose N=142 PCA components

Determine optimum k

We want to determine the optimum first a few functions.

Final PCA Transformer

We want to capture 90% of the data variance so we create a PCA transformer with N=142 components

Elbow Method

To save computing time use a subsample of azdias to determine the optimum k.

Save the result

Create List of KMeans-Estimators

Create a list of tuples (k, fitted_K-Means_model_for_k) to save computing time. When calculating KMeans-Score or Silhouette-Score the fitted estimators can be reused.

Calculate KMeans-Score (SSE)

We calculate the Kmeans score by iterating through our previously generated list 'kmeans_array' and calling the score method of the contained Kmeans estimators.

Elbow-Method

We plot the kmeans scores to find the elbow.

⇒ the elbow is not obvious so we will use further tools. We will examine the silhouette score.

Examine the silhouette score

$$ silhouette = \frac{(b-a)}{max(a,b)} \\ $$

‘The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation)’ (https://en.wikipedia.org/wiki/Silhouette_(clustering)). It is calculated using the mean intra cluster distance (a) and the mean nearest-cluster distance (b) for each sample.

The values are in the interval [-1, 1]. ‘The silhouette score of 1 means that the clusters are very dense and nicely separated. The score of 0 means that clusters are overlapping. The score of less than 0 means that data belonging to clusters may be wrong/incorrect’ (https://dzone.com/articles/kmeans-silhouette-score-explained-with-python-exam)

⇒ a high silhouette score is desirable

Store the result

Plot the data.

⇒ we use N=17 as optimum N

Part 1: Customer Segmentation Report

The main bulk of your analysis will come in this part of the project. Here, you should use unsupervised learning techniques to describe the relationship between the demographics of the company's existing customers and the general population of Germany. By the end of this part, you should be able to describe parts of the general population that are more likely to be part of the mail-order company's main customer base, and which parts of the general population are less so.

Train the final KMeans predictor

Use the optimum k to fit the final K-Means predictor

Now we can visualize the cluster assignment for AZDIAS, CUSTOMERS by plotting bar chars...

But absolute values make the clusters hard to compare. So we will plot the proportional cluster assignments...

⇒ we will investigate on clusters 2 and 8 from CUSTOMERS and compare each with [2,3,5,6] from AZDIAS.

Centroids in Component Space

We want to get an impression of how many of the pca components tu use. Therefore, we plot a heatmap of the Cluster centroids vs. the PCA components.

PCA factor loadings

We will plot the factor loadings of the first 6 PCA components to see how they are made up by the original features.

Analysis - Results

After looking at the factor loading we create a list of distinct features ordered by their weight in the factor loadings.

Analysis

We can do alot of things with the data in order to gain valuable information from the clustering. What we will do is compare the two highest customers-dominated clusters against the clusters dominated by AZDIAS. These clusters are 7 and 8:

First we have to some preprocessing again:

Next we determine the relevant features. We take the first five pricipal components and consider important features we got from the factor loadings. The number of features to consider per component was determined by reviewing the plots of the factor loadings:

Let's transform the clustered dataframes back into the original space. Also keep the cluster number

Before starting the comparison, let's look at the feature GEBURTSJAHR of the clusters dominated by CUSTOMERS versus the clusters dominated by AZDIAS:

Next let's do the same, but now we compare the CUSTOMERS clusters separately against AZDIAS We start with cluster numer 7:

And now CUSTOMERS cluster number 8 vs. AZDIAS...

Conclusion:

Ok that was very informative.

Evaluation 1 (Cluster 8)

Result

Here the observations:

Evaluation 2 (Cluster 7)

Result

⇒ Cluster 7 contains mainly couples and multiperson households

Part 2: Supervised Learning Model

Now that you've found which parts of the population are more likely to be customers of the mail-order company, it's time to build a prediction model. Each of the rows in the "MAILOUT" data files represents an individual that was targeted for a mailout campaign. Ideally, we should be able to use the demographic information from each individual to decide whether or not it will be worth it to include that person in the campaign.

The "MAILOUT" data has been split into two approximately equal parts, each with almost 43 000 data rows. In this part, you can verify your model with the "TRAIN" partition, which includes a column, "RESPONSE", that states whether or not a person became a customer of the company following the campaign. In the next part, you'll need to create predictions on the "TEST" partition, where the "RESPONSE" column has been withheld.

⇒ Ratio between class '0' and the minority class '1' is roughly 99:1. That means we have a severe imbalance.

Tune Benchmark (Randomforest)

See also https://stackoverflow.com/questions/20463281/how-do-i-solve-overfitting-in-random-forest-of-python-sklearn .

Tune Extratrees Classifier

see also https://machinelearningmastery.com/extra-trees-ensemble-with-python/

Eval on validation set

Tune XGBOOST

Step 1: Find learning_rate, scale_pos_weight and n_estimators

We will start with initial versions of learning_rate and n_estimators

Define scale_pos_weight for imbalanced learning

Step2: Tune max_depth and min_child_weight

min_child_weight [default=1]

max_depth [default=6]

For further information see https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/

Step 3: Tune Gamma

gamma [default=0]

For further information see https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/

Step 4: Tune subsample and colsample_bytree

subsample [default=1]

colsample_bytree [default=1]

Step 5: Tuning Regularization Parameters

Step 7 Finalize Learning rate Add more trees

Try learning_rate=0.01

Try learning_rate=0.02

Try learning_rate=0.03

Try learning_rate=0.035

Try learning_rate=0.038

Try learning_rate=0.04

Try learning_rate=0.05

⇒ use learning_rate=0.038

Tune MLP

Neural nets perfoem better on scaled data

Tune Logistic Regression

Part 3: Kaggle Competition

Now that you've created a model to predict which individuals are most likely to respond to a mailout campaign, it's time to test that model in competition through Kaggle. If you click on the link here, you'll be taken to the competition page where, if you have a Kaggle account, you can enter. If you're one of the top performers, you may have the chance to be contacted by a hiring manager from Arvato or Bertelsmann for an interview!

Your entry to the competition should be a CSV file with two columns. The first column should be a copy of "LNR", which acts as an ID number for each individual in the "TEST" partition. The second column, "RESPONSE", should be some measure of how likely each individual became a customer – this might not be a straightforward probability. As you should have found in Part 2, there is a large output class imbalance, where most individuals did not respond to the mailout. Thus, predicting individual classes and using accuracy does not seem to be an appropriate performance evaluation method. Instead, the competition will be using AUC to evaluate performance. The exact values of the "RESPONSE" column do not matter as much: only that the higher values try to capture as many of the actual customers as possible, early in the ROC curve sweep.

Generate Kaggle Submissions

Benchmark image-2.png

Extra Trees image.png

XGBoost image.png

MLP image-2.png

Logistic Regression image-2.png